Successive Convex Approximation Based Off-Policy Optimization for Constrained Reinforcement Learning

نویسندگان

چکیده

Constrained reinforcement learning (CRL), also termed as safe learning, is a promising technique enabling the deployment of RL agent in real-world systems. In this paper, we propose successive convex approximation based off-policy optimization (SCAOPO) algorithm to solve general CRL problem, which formulated constrained Markov decision process (CMDP) context average cost. The SCAOPO on solving sequence objective/feasibility problems obtained by replacing objective and constraint functions original problem with surrogate functions. proposed enables reuse experiences from previous updates, thereby significantly reducing implementation cost when deployed engineering systems that need online learn environment. spite time-varying state distribution stochastic bias incurred feasible initial point can still provably converge Karush-Kuhn-Tucker (KKT) almost surely.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stochastic Successive Convex Approximation for Non-Convex Constrained Stochastic Optimization

This paper proposes a constrained stochastic successive convex approximation (CSSCA) algorithm to find a stationary point for a general non-convex stochastic optimization problem, whose objective and constraint functions are nonconvex and involve expectations over random states. The existing methods for non-convex stochastic optimization, such as the stochastic (average) gradient and stochastic...

متن کامل

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Consider the problem of minimizing the sum of a smooth (possibly non-convex) and a convex (possibly nonsmooth) function involving a large number of variables. A popular approach to solve this problem is the block coordinate descent (BCD) method whereby at each iteration only one variable block is updated while the remaining variables are held fixed. With the recent advances in the developments ...

متن کامل

On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning

Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...

متن کامل

Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning

In this paper we present a new way of predicting the performance of a reinforcement learning policy given historical data that may have been generated by a different policy. The ability to evaluate a policy from historical data is important for applications where the deployment of a bad policy can be dangerous or costly. We show empirically that our algorithm produces estimates that often have ...

متن کامل

Off-Policy Shaping Ensembles in Reinforcement Learning

Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel without sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensembl...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Transactions on Signal Processing

سال: 2022

ISSN: ['1053-587X', '1941-0476']

DOI: https://doi.org/10.1109/tsp.2022.3158737